Building a network (Part 1)


Quick analysis and similarity network building

In [1]:
%pylab inline
import matplotlib.pyplot as plt
import pandas as pd, numpy as np
import seaborn as sns
sns.set(style="whitegrid")
import sqlite3
Populating the interactive namespace from numpy and matplotlib

Data

The data used here are downloaded data from the Million Song Dataset:

  • SQLite database containing most metadata about each track (NEW VERSION 03/27/2011).
  • SQLite database linking artist ID to the tags (Echo Nest and musicbrainz ones).

Pandas can read data directly from the SQLite databases.

Connect to both db

In [2]:
conn_terms = sqlite3.connect("data/artist_term.db")
In [3]:
conn_artists = sqlite3.connect("data/track_metadata.db")

Get the mbtags

These are human selected tags from musicbrainz.org. They include a large variety of tags: countries, genre, languages.

In [4]:
mbtags = pd.read_sql("SELECT * FROM artist_mbtag",conn_terms)
In [5]:
print("{:,}".format(len(mbtags["artist_id"].unique())), "unique artists, and", \
      "{:,}".format(len(mbtags["mbtag"].unique())), "unique tags;", \
     "with an average of {0:0.2f}".format(cnt_art_tag["countTags"].mean()), "per artist")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
 in ()
----> 1 print("{:,}".format(len(mbtags["artist_id"].unique())), "unique artists, and",       "{:,}".format(len(mbtags["mbtag"].unique())), "unique tags;",      "with an average of {0:0.2f}".format(cnt_art_tag["countTags"].mean()), "per artist")

NameError: name 'cnt_art_tag' is not defined
In [ ]:
cnt_art_tag = pd.pivot_table(mbtags,values="mbtag",index="artist_id",\
                             aggfunc="count").reset_index().rename(columns={"mbtag":"countTags"})
In [ ]:
cnt_art_tag.hist(bins=21)

The distribution of count tag shows that mosts songs will have only a few tags. It will be hard to build a network of similarity with these!

Build a network

Source, target and edges

The data we use is a list of tuples ( artist_id , mbtag ) for 8,838 unique artists, with 1 to 20 tags (average: 2.8).

We want to create an undirected network of artists who share common tags:

  • Sources and targets are the artists (we copy the same dataframe and rename the ids.)
  • Edges will be created by merging the source on the target, using the tags as a common field. Then we remove the edges where the source and targets are the same.
In [ ]:
source = mbtags.copy().rename(columns={"artist_id":"Source"})
target = mbtags.copy().rename(columns={"artist_id":"Target"})
edges = source.merge(target,on="mbtag",how="inner")
edges = edges[edges["Source"] != edges["Target"]]
In [ ]:
print("There are", "{:,}".format(len(source)), \
      "sources/targets connected by","{:,}".format(len(edges)), "links")
In [ ]:
mbtags["nb_tags"] = mbtags.groupby(["artist_id"])["mbtag"].transform("count")
In [ ]:
len(mbtags[mbtags["nb_tags"]>4]),len(mbtags[mbtags["nb_tags"]>2]),len(mbtags),

Remove redundant edges

This network has 8 millions edges.. with many redundant edges, so we'll add a weight to the edges and drop the duplicates (keep only one link).

In [ ]:
edges["weight"] = edges.groupby(["Source","Target"])["mbtag"].transform("count")
In [ ]:
edges = edges[["Source","Target","weight"]].drop_duplicates()
In [ ]:
len(edges)

Add attributes

Artists names

The artist_ids are not very informative, so we need to use the track_metadata dataset in order to add the name of the artists. Since it refers to tracks, there are multiple duplicates for the (artist_id,artist_name) tuples. We'll drop them.

In [ ]:
tracks = pd.read_sql("SELECT artist_id,artist_name FROM songs",conn_artists)
In [ ]:
artists = tracks.drop_duplicates()
In [ ]:
print("There were", "{:,}".format(len(tracks)),"tracks, from", "{:,}".format(len(artists)), "artists")

If we look at the unique names of artists and ids, we realise that there is a discrepancy:

In [ ]:
print(len(artists.artist_id.unique()), "unique artists")
print(len(artists.artist_name.unique()), "unique artist names")

Argh, it seems that some artist_id have duplicate names!

e.g. below for AROQS6O1187FB3D146

In [ ]:
artists[artists["artist_id"] == "AROQS6O1187FB3D146" ]

Looking at it, some are mistakes, but most are simply variations of the name), so we'll drop the duplicate artist_id and keep only the first corresponding artist_name.

In [ ]:
artists.drop_duplicates("artist_id",inplace=True)
In [ ]:
len(artists),len(artists.artist_id.unique()),len(artists.artist_name.unique())

Add Tags list

The tags are currently in one column, with one value per row. We want them in one row per artist, so we can use a clever join with the pandas groupby.

In [ ]:
mbtags["Tags"] = mbtags.groupby(['artist_id'])['mbtag'].transform(lambda x: ','.join(x))
In [ ]:
artists = artists.merge(mbtags.drop_duplicates("artist_id"),how="left")
In [ ]:
len(artists),artists.columns
In [ ]:
artists[artists["artist_id"] == "AR00GVV11C8A415A54"]

Add source names

In [ ]:
edges = edges.merge(artists.rename(columns={"artist_id":"Source"}),\
                      how="left").rename(columns={"Source":"Source_id","artist_name":"Source"})
In [ ]:
len(edges)
In [ ]:
edges.head()

Add target names

In [ ]:
edges = edges.merge(artists,left_on="Target",right_on="artist_id",how="left")
In [ ]:
edges = edges.rename(columns={"Target":"Target_id","artist_name":"Target"})[["Source","Target","Source_id",\
                                                                    "Target_id","weight"]]
In [ ]:
len(edges),edges.columns,len(edges[edges["weight"] > 3])

We will keep only the most connected edges: with more than 3 links. We export to csv to use in gephi.

In [ ]:
edges[edges["weight"] > 3][["Source","Target","weight","Source_id","Target_id"]].to_csv("edges.csv",index=None)

Get the nodes from the edges

In [ ]:
nodes = pd.DataFrame(edges[edges["weight"] > 3]["Source_id"].tolist() + \
                     edges[edges["weight"] > 3]["Target_id"].tolist(),\
                     columns=["artist_id"]).drop_duplicates()
In [ ]:
nodes = nodes.merge(mbtags[["artist_id","Tags"]],how="left").drop_duplicates()
In [ ]:
len(nodes),nodes.columns
In [ ]:
nodes = nodes.merge(artists,how="left")[["artist_name","Tags"]].rename(columns={"artist_name":"id"}).drop_duplicates()

We export to csv to use them in Gephi.

In [ ]:
nodes.to_csv("nodes.csv",index=None)